Pesquisa | Portal Regional da BVS

1.

Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA.

Lemane, Téo; Lezzoche, Nolan; Lecubin, Julien; Pelletier, Eric; Lescot, Magali; Chikhi, Rayan; Peterlongo, Pierre.

Nat Comput Sci ; 4(2): 104-109, 2024 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-38413777

RESUMO

Public sequencing databases contain vast amounts of biological information, yet they are largely underutilized as it is challenging to efficiently search them for any sequence(s) of interest. We present kmindex, an approach that can index thousands of metagenomes and perform sequence searches in a fraction of a second. The index construction is an order of magnitude faster than previous methods, while search times are two orders of magnitude faster. With negligible false positive rates below 0.01%, kmindex outperforms the precision of existing approaches by four orders of magnitude. Here we demonstrate the scalability of kmindex by successfully indexing 1,393 marine seawater metagenome samples from the Tara Oceans project. Additionally, we introduce the publicly accessible web server Ocean Read Atlas, which enables real-time queries on the Tara Oceans dataset.

Assuntos

Genômica , Água do Mar , Oceanos e Mares , Metagenoma/genética , Bases de Dados de Ácidos Nucleicos

2.

fimpera: drastic improvement of Approximate Membership Query data-structures with counts.

Robidou, Lucas; Peterlongo, Pierre.

Bioinformatics ; 39(5)2023 05 04.

Artigo em Inglês | MEDLINE | ID: mdl-37195454

RESUMO

MOTIVATION: High throughput sequencing technologies generate massive amounts of biological sequence datasets as costs fall. One of the current algorithmic challenges for exploiting these data on a global scale consists in providing efficient query engines on these petabyte-scale datasets. Most methods indexing those datasets rely on indexing words of fixed length k, called k-mers. Many applications, such as metagenomics, require the abundance of indexed k-mers as well as their simple presence or absence, but no method scales up to petabyte-scaled datasets. This deficiency is primarily because storing abundance requires explicit storage of the k-mers in order to associate them with their counts. Using counting Approximate Membership Queries (cAMQ) data structures, such as counting Bloom filters, provides a way to index large amounts of k-mers with their abundance, but at the expense of a sensible false positive rate. RESULTS: We propose a novel algorithm, called fimpera, that enables the improvement of any cAMQ performance. Applied to counting Bloom filters, our proposed algorithm reduces the false positive rate by two orders of magnitude and it improves the precision of the reported abundances. Alternatively, fimpera allows for the reduction of the size of a counting Bloom filter by two orders of magnitude while maintaining the same precision. fimpera does not introduce any memory overhead and may even reduces the query time. AVAILABILITY AND IMPLEMENTATION: https://github.com/lrobidou/fimpera.

Assuntos

Algoritmos , Software , Análise de Sequência de DNA/métodos , Metagenômica , Sequenciamento de Nucleotídeos em Larga Escala/métodos

3.

HaploBlocks: Efficient Detection of Positive Selection in Large Population Genomic Datasets.

Kirsch-Gerweck, Benedikt; Bohnenkämper, Leonard; Henrichs, Michel T; Alanko, Jarno N; Bannai, Hideo; Cazaux, Bastien; Peterlongo, Pierre; Burger, Joachim; Stoye, Jens; Diekmann, Yoan.

Mol Biol Evol ; 40(3)2023 03 04.

Artigo em Inglês | MEDLINE | ID: mdl-36790822

RESUMO

Genomic regions under positive selection harbor variation linked for example to adaptation. Most tools for detecting positively selected variants have computational resource requirements rendering them impractical on population genomic datasets with hundreds of thousands of individuals or more. We have developed and implemented an efficient haplotype-based approach able to scan large datasets and accurately detect positive selection. We achieve this by combining a pattern matching approach based on the positional Burrows-Wheeler transform with model-based inference which only requires the evaluation of closed-form expressions. We evaluate our approach with simulations, and find it to be both sensitive and specific. The computational resource requirements quantified using UK Biobank data indicate that our implementation is scalable to population genomic datasets with millions of individuals. Our approach may serve as an algorithmic blueprint for the era of "big data" genomics: a combinatorial core coupled with statistical inference in closed form.

Assuntos

Genética Populacional , Metagenômica , Genômica , Genoma , Haplótipos

4.

k mdiff, large-scale and user-friendly differential k-mer analyses.

Lemane, Téo; Chikhi, Rayan; Peterlongo, Pierre.

Bioinformatics ; 38(24): 5443-5445, 2022 12 13.

Artigo em Inglês | MEDLINE | ID: mdl-36315078

RESUMO

SUMMARY: Genome wide association studies elucidate links between genotypes and phenotypes. Recent studies point out the interest of conducting such experiments using k-mers as the base signal instead of single-nucleotide polymorphisms. We propose a tool, kmdiff, that performs differential k-mer analyses on large sequencing cohorts in an order of magnitude less time and memory than previously possible. AVAILABILITYAND IMPLEMENTATION: https://github.com/tlemane/kmdiff. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Software , Análise de Sequência de DNA , Estudo de Associação Genômica Ampla , Genótipo

5.

Genomic evidence for global ocean plankton biogeography shaped by large-scale current systems.

Richter, Daniel J; Watteaux, Romain; Vannier, Thomas; Leconte, Jade; Frémont, Paul; Reygondeau, Gabriel; Maillet, Nicolas; Henry, Nicolas; Benoit, Gaëtan; Da Silva, Ophélie; Delmont, Tom O; Fernàndez-Guerra, Antonio; Suweis, Samir; Narci, Romain; Berney, Cédric; Eveillard, Damien; Gavory, Frederick; Guidi, Lionel; Labadie, Karine; Mahieu, Eric; Poulain, Julie; Romac, Sarah; Roux, Simon; Dimier, Céline; Kandels, Stefanie; Picheral, Marc; Searson, Sarah; Pesant, Stéphane; Aury, Jean-Marc; Brum, Jennifer R; Lemaitre, Claire; Pelletier, Eric; Bork, Peer; Sunagawa, Shinichi; Lombard, Fabien; Karp-Boss, Lee; Bowler, Chris; Sullivan, Matthew B; Karsenti, Eric; Mariadassou, Mahendra; Probert, Ian; Peterlongo, Pierre; Wincker, Patrick; de Vargas, Colomban; Ribera d'Alcalà, Maurizio; Iudicone, Daniele; Jaillon, Olivier.

Elife ; 112022 08 03.

Artigo em Inglês | MEDLINE | ID: mdl-35920817

RESUMO

Biogeographical studies have traditionally focused on readily visible organisms, but recent technological advances are enabling analyses of the large-scale distribution of microscopic organisms, whose biogeographical patterns have long been debated. Here we assessed the global structure of plankton geography and its relation to the biological, chemical, and physical context of the ocean (the 'seascape') by analyzing metagenomes of plankton communities sampled across oceans during the Tara Oceans expedition, in light of environmental data and ocean current transport. Using a consistent approach across organismal sizes that provides unprecedented resolution to measure changes in genomic composition between communities, we report a pan-ocean, size-dependent plankton biogeography overlying regional heterogeneity. We found robust evidence for a basin-scale impact of transport by ocean currents on plankton biogeography, and on a characteristic timescale of community dynamics going beyond simple seasonality or life history transitions of plankton.

Oceans are brimming with life invisible to our eyes, a myriad of species of bacteria, viruses and other microscopic organisms essential for the health of the planet. These 'marine plankton' are unable to swim against currents and should therefore be constantly on the move, yet previous studies have suggested that distinct species of plankton may in fact inhabit different oceanic regions. However, proving this theory has been challenging; collecting plankton is logistically difficult, and it is often impossible to distinguish between species simply by examining them under a microscope. However, within the last decade, a research schooner called Tara has travelled the globe to gather thousands of plankton samples. At the same time, advances in genomics have made it possible to identify species based only on fragments of their DNA sequence. To understand the hidden geography of plankton communities in Earth's oceans, Richter et al. pored over DNA from the Tara Oceans expedition. This revealed that, despite being unable to resist the flow of water, various planktonic species which live close to the surface manage to occupy distinct, stable provinces shaped by currents. Different sizes of plankton are distributed in different sized provinces, with the smallest organisms tending to inhabit the smallest areas. Comparing DNA similarities and speeds of currents at the ocean surface revealed how these might stretch and mix plankton communities. Plankton play a critical role in the health of the ocean and the chemical cycles of planet Earth. These results could allow deeper investigation by marine modellers, ecologists, and evolutionary biologists. Meanwhile, work is already underway to investigate how climate change might impact this hidden geography.

Assuntos

Ecossistema , Plâncton , Genômica , Geografia , Oceanos e Mares , Plâncton/genética

6.

The K-mer File Format: a standardized and compact disk representation of sets of k-mers.

Dufresne, Yoann; Lemane, Teo; Marijon, Pierre; Peterlongo, Pierre; Rahman, Amatur; Kokot, Marek; Medvedev, Paul; Deorowicz, Sebastian; Chikhi, Rayan.

Bioinformatics ; 38(18): 4423-4425, 2022 09 15.

Artigo em Inglês | MEDLINE | ID: mdl-35904548

RESUMO

SUMMARY: Bioinformatics applications increasingly rely on ad hoc disk storage of k-mer sets, e.g. for de Bruijn graphs or alignment indexes. Here, we introduce the K-mer File Format as a general lossless framework for storing and manipulating k-mer sets, realizing space savings of 3-5× compared to other formats, and bringing interoperability across tools. AVAILABILITY AND IMPLEMENTATION: Format specification, C++/Rust API, tools: https://github.com/Kmer-File-Format/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Software , Análise de Sequência de DNA , Discos Compactos

7.

kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections.

Lemane, Téo; Medvedev, Paul; Chikhi, Rayan; Peterlongo, Pierre.

Bioinform Adv ; 2(1): vbac029, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-36699393

RESUMO

Summary: When indexing large collections of short-read sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of k-mers which approximates the desired set of all the non-erroneous k-mers present in the sample. However, this approximation is imperfect, especially in the case of metagenomics data. Erroneous but abundant k-mers are wrongly included, and non-erroneous but low-abundant ones are wrongly discarded. We propose kmtricks, a novel approach for generating Bloom filters from terabase-sized collections of sequencing data. Our main contributions are (i) an efficient method for jointly counting k-mers across multiple samples, including a streamlined Bloom filter construction by directly counting, partitioning and sorting hashes instead of k-mers, which is approximately four times faster than state-of-the-art tools; (ii) a novel technique that takes advantage of joint counting to preserve low-abundant k-mers present in several samples, improving the recovery of non-erroneous k-mers. Our experiments highlight that this technique preserves around 8× more k-mers than the usual yet crude filtering of low-abundance k-mers in a large metagenomics dataset. Availability and implementation: https://github.com/tlemane/kmtricks. Supplementary information: Supplementary data are available at Bioinformatics Advances online.

8.

StrainFLAIR: strain-level profiling of metagenomic samples using variation graphs.

Da Silva, Kévin; Pons, Nicolas; Berland, Magali; Plaza Oñate, Florian; Almeida, Mathieu; Peterlongo, Pierre.

PeerJ ; 9: e11884, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34513324

RESUMO

Current studies are shifting from the use of single linear references to representation of multiple genomes organised in pangenome graphs or variation graphs. Meanwhile, in metagenomic samples, resolving strain-level abundances is a major step in microbiome studies, as associations between strain variants and phenotype are of great interest for diagnostic and therapeutic purposes. We developed StrainFLAIR with the aim of showing the feasibility of using variation graphs for indexing highly similar genomic sequences up to the strain level, and for characterizing a set of unknown sequenced genomes by querying this graph. On simulated data composed of mixtures of strains from the same bacterial species Escherichia coli, results show that StrainFLAIR was able to distinguish and estimate the abundances of close strains, as well as to highlight the presence of a new strain close to a referenced one and to estimate its abundance. On a real dataset composed of a mix of several bacterial species and several strains for the same species, results show that in a more complex configuration StrainFLAIR correctly estimates the abundance of each strain. Hence, results demonstrated how graph representation of multiple close genomes can be used as a reference to characterize a sample at the strain level.

9.

metaVaR: Introducing metavariant species models for reference-free metagenomic-based population genomics.

Laso-Jadart, Romuald; Ambroise, Christophe; Peterlongo, Pierre; Madoui, Mohammed-Amin.

PLoS One ; 15(12): e0244637, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-33378381

RESUMO

The availability of large metagenomic data offers great opportunities for the population genomic analysis of uncultured organisms, which represent a large part of the unexplored biosphere and play a key ecological role. However, the majority of these organisms lack a reference genome or transcriptome, which constitutes a technical obstacle for classical population genomic analyses. We introduce the metavariant species (MVS) model, in which a species is represented only by intra-species nucleotide polymorphism. We designed a method combining reference-free variant calling, multiple density-based clustering and maximum-weighted independent set algorithms to cluster intra-species variants into MVSs directly from multisample metagenomic raw reads without a reference genome or read assembly. The frequencies of the MVS variants are then used to compute population genomic statistics such as FST, in order to estimate genomic differentiation between populations and to identify loci under natural selection. The MVS construction was tested on simulated and real metagenomic data. MVSs showed the required quality for robust population genomics and allowed an accurate estimation of genomic differentiation (ΔFST < 0.0001 and <0.03 on simulated and real data respectively). Loci predicted under natural selection on real data were all detected by MVSs. MVSs represent a new paradigm that may simplify and enhance holistic approaches for population genomics and the evolution of microorganisms.

Assuntos

Biologia Computacional/métodos , Variação Genética , Metagenômica/métodos , Análise por Conglomerados , Genética Populacional , Modelos Genéticos , Seleção Genética , Software

10.

Investigating population-scale allelic differential expression in wild populations of Oithona similis (Cyclopoida, Claus, 1866).

Laso-Jadart, Romuald; Sugier, Kevin; Petit, Emmanuelle; Labadie, Karine; Peterlongo, Pierre; Ambroise, Christophe; Wincker, Patrick; Jamet, Jean-Louis; Madoui, Mohammed-Amin.

Ecol Evol ; 10(16): 8894-8905, 2020 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-32884665

RESUMO

Acclimation allowed by variation in gene or allele expression in natural populations is increasingly understood as a decisive mechanism, as much as adaptation, for species evolution. However, for small eukaryotic organisms, as species from zooplankton, classical methods face numerous challenges. Here, we propose the concept of allelic differential expression at the population-scale (psADE) to investigate the variation in allele expression in natural populations. We developed a novel approach to detect psADE based on metagenomic and metatranscriptomic data from environmental samples. This approach was applied on the widespread marine copepod, Oithona similis, by combining samples collected during the Tara Oceans expedition (2009-2013) and de novo transcriptome assemblies. Among a total of 25,768 single nucleotide variants (SNVs) of O. similis, 572 (2.2%) were affected by psADE in at least one population (FDR < 0.05). The distribution of SNVs under psADE in different populations is significantly shaped by population genomic differentiation (Pearson r = 0.87, p = 5.6 × 10-30), supporting a partial genetic control of psADE. Moreover, a significant amount of SNVs (0.6%) were under both selection and psADE (p < .05), supporting the hypothesis that natural selection and psADE tends to impact common loci. Population-scale allelic differential expression offers new insights into the gene regulation control in populations and its link with natural selection.

11.

DiscoSnp-RAD: de novo detection of small variants for RAD-Seq population genomics.

Gauthier, Jérémy; Mouden, Charlotte; Suchan, Tomasz; Alvarez, Nadir; Arrigo, Nils; Riou, Chloé; Lemaitre, Claire; Peterlongo, Pierre.

PeerJ ; 8: e9291, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-32566401

RESUMO

Restriction site Associated DNA Sequencing (RAD-Seq) is a technique characterized by the sequencing of specific loci along the genome that is widely employed in the field of evolutionary biology since it allows to exploit variants (mainly Single Nucleotide Polymorphism-SNPs) information from entire populations at a reduced cost. Common RAD dedicated tools, such as STACKS or IPyRAD, are based on all-vs-all read alignments, which require consequent time and computing resources. We present an original method, DiscoSnp-RAD, that avoids this pitfall since variants are detected by exploiting specific parts of the assembly graph built from the reads, hence preventing all-vs-all read alignments. We tested the implementation on simulated datasets of increasing size, up to 1,000 samples, and on real RAD-Seq data from 259 specimens of Chiastocheta flies, morphologically assigned to seven species. All individuals were successfully assigned to their species using both STRUCTURE and Maximum Likelihood phylogenetic reconstruction. Moreover, identified variants succeeded to reveal a within-species genetic structure linked to the geographic distribution. Furthermore, our results show that DiscoSnp-RAD is significantly faster than state-of-the-art tools. The overall results show that DiscoSnp-RAD is suitable to identify variants from RAD-Seq data, it does not require time-consuming parameterization steps and it stands out from other tools due to its completely different principle, making it substantially faster, in particular on large datasets.

12.

SVJedi: genotyping structural variations with long reads.

Lecompte, Lolita; Peterlongo, Pierre; Lavenier, Dominique; Lemaitre, Claire.

Bioinformatics ; 36(17): 4568-4575, 2020 11 01.

Artigo em Inglês | MEDLINE | ID: mdl-32437523

RESUMO

MOTIVATION: Studies on structural variants (SVs) are expanding rapidly. As a result, and thanks to third generation sequencing technologies, the number of discovered SVs is increasing, especially in the human genome. At the same time, for several applications such as clinical diagnoses, it is important to genotype newly sequenced individuals on well-defined and characterized SVs. Whereas several SV genotypers have been developed for short read data, there is a lack of such dedicated tool to assess whether known SVs are present or not in a new long read sequenced sample, such as the one produced by Pacific Biosciences or Oxford Nanopore Technologies. RESULTS: We present a novel method to genotype known SVs from long read sequencing data. The method is based on the generation of a set of representative allele sequences that represent the two alleles of each structural variant. Long reads are aligned to these allele sequences. Alignments are then analyzed and filtered out to keep only informative ones, to quantify and estimate the presence of each SV allele and the allele frequencies. We provide an implementation of the method, SVJedi, to genotype SVs with long reads. The tool has been applied to both simulated and real human datasets and achieves high genotyping accuracy. We show that SVJedi obtains better performances than other existing long read genotyping tools and we also demonstrate that SV genotyping is considerably improved with SVJedi compared to other approaches, namely SV discovery and short read SV genotyping approaches. AVAILABILITY AND IMPLEMENTATION: https://github.com/llecompte/SVJedi.git. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Genoma Humano , Software , Variação Estrutural do Genoma , Genótipo , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA

13.

Cytosine methylation of mature microRNAs inhibits their functions and is associated with poor prognosis in glioblastoma multiforme.

Cheray, Mathilde; Etcheverry, Amandine; Jacques, Camille; Pacaud, Romain; Bougras-Cartron, Gwenola; Aubry, Marc; Denoual, Florent; Peterlongo, Pierre; Nadaradjane, Arulraj; Briand, Joséphine; Akcha, Farida; Heymann, Dominique; Vallette, François M; Mosser, Jean; Ory, Benjamin; Cartron, Pierre-François.

Mol Cancer ; 19(1): 36, 2020 02 25.

Artigo em Inglês | MEDLINE | ID: mdl-32098627

RESUMO

BACKGROUND: Literature reports that mature microRNA (miRNA) can be methylated at adenosine, guanosine and cytosine. However, the molecular mechanisms involved in cytosine methylation of miRNAs have not yet been fully elucidated. Here we investigated the biological role and underlying mechanism of cytosine methylation in miRNAs in glioblastoma multiforme (GBM). METHODS: RNA immunoprecipitation with the anti-5methylcytosine (5mC) antibody followed by Array, ELISA, dot blot, incorporation of a radio-labelled methyl group in miRNA, and miRNA bisulfite sequencing were perfomred to detect the cytosine methylation in mature miRNA. Cross-Linking immunoprecipiation qPCR, transfection with methylation/unmethylated mimic miRNA, luciferase promoter reporter plasmid, Biotin-tagged 3'UTR/mRNA or miRNA experiments and in vivo assays were used to investigate the role of methylated miRNAs. Finally, the prognostic value of methylated miRNAs was analyzed in a cohorte of GBM pateints. RESULTS: Our study reveals that a significant fraction of miRNAs contains 5mC. Cellular experiments show that DNMT3A/AGO4 methylated miRNAs at cytosine residues inhibit the formation of miRNA/mRNA duplex and leading to the loss of their repressive function towards gene expression. In vivo experiments show that cytosine-methylation of miRNA abolishes the tumor suppressor function of miRNA-181a-5p miRNA for example. Our study also reveals that cytosine-methylation of miRNA-181a-5p results is associated a poor prognosis in GBM patients. CONCLUSION: Together, our results indicate that the DNMT3A/AGO4-mediated cytosine methylation of miRNA negatively.

Assuntos

Biomarcadores Tumorais/genética , Citosina/química , Metilação de DNA , Glioblastoma/patologia , MicroRNAs/genética , Animais , Apoptose , Proteínas Argonautas/genética , Proteínas Argonautas/metabolismo , Proliferação de Células , DNA (Citosina-5-)-Metiltransferases/genética , DNA (Citosina-5-)-Metiltransferases/metabolismo , DNA Metiltransferase 3A , Fatores de Iniciação em Eucariotos/genética , Fatores de Iniciação em Eucariotos/metabolismo , Regulação Neoplásica da Expressão Gênica , Glioblastoma/genética , Glioblastoma/metabolismo , Humanos , Camundongos , Camundongos Nus , Prognóstico , Regiões Promotoras Genéticas , Taxa de Sobrevida , Células Tumorais Cultivadas , Ensaios Antitumorais Modelo de Xenoenxerto

14.

Finding all maximal perfect haplotype blocks in linear time.

Alanko, Jarno; Bannai, Hideo; Cazaux, Bastien; Peterlongo, Pierre; Stoye, Jens.

Algorithms Mol Biol ; 15: 2, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-32055252

RESUMO

Recent large-scale community sequencing efforts allow at an unprecedented level of detail the identification of genomic regions that show signatures of natural selection. Traditional methods for identifying such regions from individuals' haplotype data, however, require excessive computing times and therefore are not applicable to current datasets. In 2019, Cunha et al. (Advances in bioinformatics and computational biology: 11th Brazilian symposium on bioinformatics, BSB 2018, Niterói, Brazil, October 30 - November 1, 2018, Proceedings, 2018. 10.1007/978-3-030-01722-4_3) suggested the maximal perfect haplotype block as a very simple combinatorial pattern, forming the basis of a new method to perform rapid genome-wide selection scans. The algorithm they presented for identifying these blocks, however, had a worst-case running time quadratic in the genome length. It was posed as an open problem whether an optimal, linear-time algorithm exists. In this paper we give two algorithms that achieve this time bound, one conceptually very simple one using suffix trees and a second one using the positional Burrows-Wheeler Transform, that is very efficient also in practice.

15.

Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs.

Limasset, Antoine; Flot, Jean-François; Peterlongo, Pierre.

Bioinformatics ; 36(5): 1374-1381, 2020 03 01.

Artigo em Inglês | MEDLINE | ID: mdl-30785192

RESUMO

MOTIVATION: Short-read accuracy is important for downstream analyses such as genome assembly and hybrid long-read correction. Despite much work on short-read correction, present-day correctors either do not scale well on large datasets or consider reads as mere suites of k-mers, without taking into account their full-length sequence information. RESULTS: We propose a new method to correct short reads using de Bruijn graphs and implement it as a tool called Bcool. As a first step, Bcool constructs a compacted de Bruijn graph from the reads. This graph is filtered on the basis of k-mer abundance then of unitig abundance, thereby removing most sequencing errors. The cleaned graph is then used as a reference on which the reads are mapped to correct them. We show that this approach yields more accurate reads than k-mer-spectrum correctors while being scalable to human-size genomic datasets and beyond. AVAILABILITY AND IMPLEMENTATION: The implementation is open source, available at http://github.com/Malfoy/BCOOL under the Affero GPL license and as a Bioconda package. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Genoma Humano , Humanos , Análise de Sequência de DNA , Software

16.

ELECTOR: evaluator for long reads correction methods.

Marchet, Camille; Morisse, Pierre; Lecompte, Lolita; Lefebvre, Arnaud; Lecroq, Thierry; Peterlongo, Pierre; Limasset, Antoine.

NAR Genom Bioinform ; 2(1): lqz015, 2020 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-33575566

RESUMO

The error rates of third-generation sequencing data have been capped >5%, mainly containing insertions and deletions. Thereby, an increasing number of diverse long reads correction methods have been proposed. The quality of the correction has huge impacts on downstream processes. Therefore, developing methods allowing to evaluate error correction tools with precise and reliable statistics is a crucial need. These evaluation methods rely on costly alignments to evaluate the quality of the corrected reads. Thus, key features must allow the fast comparison of different tools, and scale to the increasing length of the long reads. Our tool, ELECTOR, evaluates long reads correction and is directly compatible with a wide range of error correction tools. As it is based on multiple sequence alignment, we introduce a new algorithmic strategy for alignment segmentation, which enables us to scale to large instances using reasonable resources. To our knowledge, we provide the unique method that allows producing reproducible correction benchmarks on the latest ultra-long reads (>100 k bases). It is also faster than the current state-of-the-art on other datasets and provides a wider set of metrics to assess the read quality improvement after correction. ELECTOR is available on GitHub (https://github.com/kamimrcht/ELECTOR) and Bioconda.

17.

SimkaMin: fast and resource frugal de novo comparative metagenomics.

Benoit, Gaëtan; Mariadassou, Mahendra; Robin, Stéphane; Schbath, Sophie; Peterlongo, Pierre; Lemaitre, Claire.

Bioinformatics ; 36(4): 1275-1276, 2020 02 15.

Artigo em Inglês | MEDLINE | ID: mdl-31504187

RESUMO

MOTIVATION: De novo comparative metagenomics is one of the most straightforward ways to analyze large sets of metagenomic data. Latest methods use the fraction of shared k-mers to estimate genomic similarity between read sets. However, those methods, while extremely efficient, are still limited by computational needs for practical usage outside of large computing facilities. RESULTS: We present SimkaMin, a quick comparative metagenomics tool with low disk and memory footprints, thanks to an efficient data subsampling scheme used to estimate Bray-Curtis and Jaccard dissimilarities. One billion metagenomic reads can be analyzed in <3 min, with tiny memory (1.09 GB) and disk (≈0.3 GB) requirements and without altering the quality of the downstream comparative analyses, making of SimkaMin a tool perfectly tailored for very large-scale metagenomic projects. AVAILABILITY AND IMPLEMENTATION: https://github.com/GATB/simka. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Metagenômica , Software , Algoritmos , Genômica , Metagenoma , Análise de Sequência de DNA

18.

Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs.

Limasset, Antoine; Flot, Jean-François; Peterlongo, Pierre.

Bioinformatics ; 36(2): 651, 2020 01 15.

Artigo em Inglês | MEDLINE | ID: mdl-31808510

19.

Peptimapper: proteogenomics workflow for the expert annotation of eukaryotic genomes.

Guillot, Laetitia; Delage, Ludovic; Viari, Alain; Vandenbrouck, Yves; Com, Emmanuelle; Ritter, Andrés; Lavigne, Régis; Marie, Dominique; Peterlongo, Pierre; Potin, Philippe; Pineau, Charles.

BMC Genomics ; 20(1): 56, 2019 Jan 17.

Artigo em Inglês | MEDLINE | ID: mdl-30654742

RESUMO

BACKGROUND: Accurate structural annotation of genomes is still a challenge, despite the progress made over the past decade. The prediction of gene structure remains difficult, especially for eukaryotic species, and is often erroneous and incomplete. We used a proteogenomics strategy, taking advantage of the combination of proteomics datasets and bioinformatics tools, to identify novel protein coding-genes and splice isoforms, assign correct start sites, and validate predicted exons and genes. RESULTS: Our proteogenomics workflow, Peptimapper, was applied to the genome annotation of Ectocarpus sp., a key reference genome for both the brown algal lineage and stramenopiles. We generated proteomics data from various life cycle stages of Ectocarpus sp. strains and sub-cellular fractions using a shotgun approach. First, we directly generated peptide sequence tags (PSTs) from the proteomics data. Second, we mapped PSTs onto the translated genomic sequence. Closely located hits (i.e., PSTs locations on the genome) were then clustered to detect potential coding regions based on parameters optimized for the organism. Third, we evaluated each cluster and compared it to gene predictions from existing conventional genome annotation approaches. Finally, we integrated cluster locations into GFF files to use a genome viewer. We identified two potential novel genes, a ribosomal protein L22 and an aryl sulfotransferase and corrected the gene structure of a dihydrolipoamide acetyltransferase. We experimentally validated the results by RT-PCR and using transcriptomics data. CONCLUSIONS: Peptimapper is a complementary tool for the expert annotation of genomes. It is suitable for any organism and is distributed through a Docker image available on two public bioinformatics docker repositories: Docker Hub and BioShaDock. This workflow is also accessible through the Galaxy framework and for use by non-computer scientists at https://galaxy.protim.eu . Data are available via ProteomeXchange under identifier PXD010618.

Assuntos

Eucariotos/genética , Genoma , Anotação de Sequência Molecular , Proteogenômica/métodos , Software , Fluxo de Trabalho , Sequência de Aminoácidos , Códon/genética , Espectrometria de Massas , Peptídeos/química , Peptídeos/metabolismo , Reprodutibilidade dos Testes

20.

Discovering millions of plankton genomic markers from the Atlantic Ocean and the Mediterranean Sea.

Arif, Majda; Gauthier, Jérémy; Sugier, Kevin; Iudicone, Daniele; Jaillon, Olivier; Wincker, Patrick; Peterlongo, Pierre; Madoui, Mohammed-Amin.

Mol Ecol Resour ; 19(2): 526-535, 2019 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-30575285

RESUMO

Comparison of the molecular diversity in all plankton populations present in geographically distant water columns may allow for a holistic view of the connectivity, isolation and adaptation of organisms in the marine environment. In this context, a large-scale detection and analysis of genomic variants directly in metagenomic data appeared as a powerful strategy for the identification of genetic structures and genes under natural selection in plankton. Here, we used discosnp++, a reference-free variant caller, to produce genetic variants from large-scale metagenomic data and assessed its accuracy on the copepod Oithona nana in terms of variant calling, allele frequency estimation and population genomic statistics by comparing it to the state-of-the-art method. discosnp ++ produces variants leading to similar conclusions regarding the genetic structure and identification of loci under natural selection. discosnp++ was then applied to 120 metagenomic samples from four size fractions, including prokaryotes, protists and zooplankton sampled from 39 tara Oceans sampling stations located in the Atlantic Ocean and the Mediterranean Sea to produce a new set of marine genomic markers containing more than 19 million of variants. This new genomic resource can be used by the community to relocate these markers on their plankton genomes or transcriptomes of interest. This resource will be updated with new marine expeditions and the increase of metagenomic data (availability: http://bioinformatique.rennes.inria.fr/taravariants/).

Assuntos

Organismos Aquáticos/classificação , Marcadores Genéticos , Genética Populacional/métodos , Técnicas de Genotipagem/métodos , Metagenômica/métodos , Plâncton/genética , Animais , Organismos Aquáticos/genética , Oceano Atlântico , Mar Mediterrâneo

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA